Goto

Collaborating Authors

 mobile agent


Improved Linear-Time Construction of Minimal Dominating Set via Mobile Agents

Chand, Prabhat Kumar, Molla, Anisur Rahaman

arXiv.org Artificial Intelligence

The use of autonomous agents to solve graph problems has recently attracted significant attention. Such agents, representing entities like self-driving cars, drones, robots, or distributed processes, combine two defining capabilities: they can perform local computations under strict memory constraints, and they can traverse networks, moving between nodes while retaining only limited information. A crucial observation in this model is that local computation cost is essentially negligible compared to movement, as in real-world scenarios where the cost of physical traversal (for example, a self-driven car traversing across mutiple cities) far outweighs local processing. Consequently, research in this area has focused on minimising movement while still enabling efficient solutions to classical graph problems. Several fundamental graph problems, such as computing minimal dominating sets and independent sets, leader election, spanning tree construction, and community detection, have been extensively studied both in the classical distributed model and, more recently, in the mobile-agent model. For instance, dominating set construction has been investigated in the mobile-agent setting [2] and refined in subsequent works [3, 4, 5], while the closely related maximal independent set (MIS) problem has also been explored [6]. The same framework has produced algorithms for spanning structures, including BFS trees [7, 8], MSTs [3, 5], and general spanning trees [9]. These developments have further led to increasingly efficient approaches for leader election.


Practical and Stealthy Touch-Guided Jailbreak Attacks on Deployed Mobile Vision-Language Agents

Ding, Renhua, Yang, Xiao, Fang, Zhengwei, Luo, Jun, He, Kun, Zhu, Jun

arXiv.org Artificial Intelligence

Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities in their perception and interaction remain critically understudied. Existing research often relies on conspicuous overlays, elevated permissions, or unrealistic threat assumptions, limiting stealth and real-world feasibility. In this paper, we introduce a practical and stealthy jailbreak attack framework, which comprises three key components: (i) non-privileged perception compromise, which injects visual payloads into the application interface without requiring elevated system permissions; (ii) agent-attributable activation, which leverages input attribution signals to distinguish agent from human interactions and limits prompt exposure to transient intervals to preserve stealth from end users; and (iii) efficient one-shot jailbreak, a heuristic iterative deepening search algorithm (HG-IDA*) that performs keyword-level detoxification to bypass built-in safety alignment of LVLMs. Moreover, we developed three representative Android applications and curated a prompt-injection dataset for mobile agents. We evaluated our attack across multiple LVLM backends, including closed-source services and representative open-source models, and observed high planning and execution hijack rates (e.g., GPT-4o: 82.5% planning / 75.0% execution), exposing a fundamental security vulnerability in current mobile agents and underscoring critical implications for autonomous smartphone operation.


EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Yi, Biao, Hu, Xavier, Chen, Yurun, Zhang, Shengyu, Yang, Hongxia, Wu, Fan

arXiv.org Artificial Intelligence

To tackle increasingly complex tasks, recent research on mobile agents has shifted towards multi-agent collaboration. Current mobile multi-agent systems are primarily deployed in the cloud, leading to high latency and operational costs. A straightforward idea is to deploy a device-cloud collaborative multi-agent system, which is nontrivial, as directly extending existing systems introduces new challenges: (1) reliance on cloud-side verification requires uploading mobile screenshots, compromising user privacy; and (2) open-loop cooperation lacking device-to-cloud feedback, under-utilizing device resources and increasing latency. To overcome these limitations, we propose EcoAgent, a closed-loop device-cloud collaborative multi-agent framework designed for privacy-aware, efficient, and responsive mobile automation. EcoAgent integrates a novel reasoning approach, Dual-ReACT, into the cloud-based Planning Agent, fully exploiting cloud reasoning to compensate for limited on-device capacity, thereby enabling device-side verification and lightweight feedback. Furthermore, the device-based Observation Agent leverages a Pre-understanding Module to summarize screen content into concise textual descriptions, significantly reducing token usage and device-cloud communication overhead while preserving privacy. Experiments on Android-World demonstrate that EcoAgent matches the task success rates of fully cloud-based agents, while reducing resource consumption and response latency.


AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

Fan, Jingru, Dang, Yufan, Wu, Jingyao, Li, Huatao, Yang, Runde, Yang, Xiyuan, Wang, Yuheng, Qian, Chen

arXiv.org Artificial Intelligence

With the raid evolution of large language models and multimodal models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that should be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, APPs, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose mobile agent that operates across applications. AppCopilot operationalizes this position through an end-to-end pipeline spanning data collection, training, finetuning, efficient inference, and PC/mobile application. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables experiential adaptation, voice interaction, function calling, cross-APP and cross-device orchestration, and comprehensive mobile APP support. The system design incorporates profiling-driven optimization for latency and memory across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements on four dimensions: stronger generalization, higher precision of on screen actions, more reliable long horizon task completion, and faster, more resource efficient runtime. By articulating a cohesive position and a reference architecture that closes the loop from data collection, training to finetuning and efficient inference, this paper offers a concrete roadmap for general purpose mobile agent and provides actionable guidance.


FOGMACHINE -- Leveraging Discrete-Event Simulation and Scene Graphs for Modeling Hierarchical, Interconnected Environments under Partial Observations from Mobile Agents

Ohnemus, Lars, Hantke, Nils, Weißer, Max, Furmans, Kai

arXiv.org Artificial Intelligence

Dynamic Scene Graphs (DSGs) provide a structured representation of hierarchical, interconnected environments, but current approaches struggle to capture stochastic dynamics, partial observability, and multi-agent activity. These aspects are critical for embodied AI, where agents must act under uncertainty and delayed perception. We introduce FOGMACHINE , an open-source framework that fuses DSGs with discrete-event simulation to model object dynamics, agent observations, and interactions at scale. This setup enables the study of uncertainty propagation, planning under limited perception, and emergent multi-agent behavior. Experiments in urban scenarios illustrate realistic temporal and spatial patterns while revealing the challenges of belief estimation under sparse observations. By combining structured representations with efficient simulation, FOGMACHINE establishes an effective tool for benchmarking, model training, and advancing embodied AI in complex, uncertain environments.


Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents

Wang, Xuan, Liang, Siyuan, Liu, Zhe, Yu, Yi, Liu, Aishan, Lu, Yuliang, Gao, Xitong, Chang, Ee-Chien

arXiv.org Artificial Intelligence

Mobile agents powered by vision-language models (VLMs) are increasingly adopted for tasks such as UI automation and camera-based assistance. These agents are typically fine-tuned using small-scale, user-collected data, making them susceptible to stealthy training-time threats. This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents. The attack injects malicious behaviors into the model by modifying only the visual input while preserving textual prompts and instructions, achieving stealth through the complete absence of textual anomalies. Once the agent is fine-tuned on this poisoned data, adding a predefined visual pattern (trigger) at inference time activates the attacker-specified behavior (backdoor). Our attack aligns the training gradients of poisoned samples with those of an attacker-specified target instance, effectively embedding backdoor-specific features into the poisoned data. To ensure the robustness and stealthiness of the attack, we design three trigger variants that better resemble real-world scenarios: static patches, dynamic motion patterns, and low-opacity blended content. Extensive experiments on six Android applications and three mobile-compatible VLMs demonstrate that our attack achieves high success rates (ASR up to 94.67%) while preserving clean-task behavior (FSR up to 95.85%). We further conduct ablation studies to understand how key design factors impact attack reliability and stealth. These findings is the first to reveal the security vulnerabilities of mobile agents and their susceptibility to backdoor injection, underscoring the need for robust defenses in mobile agent adaptation pipelines.


MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Loo, Gowen, Liu, Chang, Yin, Qinghong, Chen, Xiang, Chen, Jiawei, Zhang, Jingyuan, Tian, Yu

arXiv.org Artificial Intelligence

Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.


MobiAgent: A Systematic Framework for Customizable Mobile Agents

Zhang, Cheng, Feng, Erhu, Zhao, Xi, Zhao, Yisheng, Gong, Wangbo, Sun, Jiahui, Du, Dong, Hua, Zhichao, Xia, Yubin, Chen, Haibo

arXiv.org Artificial Intelligence

With the rapid advancement of Vision-Language Models (VLMs), GUI-based mobile agents have emerged as a key development direction for intelligent mobile systems. However, existing agent models continue to face significant challenges in real-world task execution, particularly in terms of accuracy and efficiency. To address these limitations, we propose MobiAgent, a comprehensive mobile agent system comprising three core components: the MobiMind-series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite. Furthermore, recognizing that the capabilities of current mobile agents are still limited by the availability of high-quality data, we have developed an AI-assisted agile data collection pipeline that significantly reduces the cost of manual annotation. Compared to both general-purpose LLMs and specialized GUI agent models, MobiAgent achieves state-of-the-art performance in real-world mobile scenarios.


Synesthesia of Machines (SoM)-Based Task-Driven MIMO System for Image Transmission

Li, Sijiang, Zhang, Rongqing, Cheng, Xiang, Tang, Jian

arXiv.org Artificial Intelligence

--T o support cooperative perception (CP) of networked mobile agents in dynamic scenarios, the efficient and robust transmission of sensory data is a critical challenge. Deep learning-based joint source-channel coding (JSCC) has demonstrated promising results for image transmission under adverse channel conditions, outperforming traditional rule-based codecs. While recent works have explored to combine JSCC with the widely adopted multiple-input multiple-output (MIMO) technology, these approaches are still limited to the discrete-time analog transmission (DT A T) model and simple tasks. Given the limited performance of existing MIMO JSCC schemes in supporting complex CP tasks for networked mobile agents with digital MIMO communication systems, this paper presents a Synesthesia of Machines (SoM)-based task-driven MIMO system for image transmission, referred to as SoM-MIMO. By leveraging the structural properties of the feature pyramid for perceptual tasks and the channel properties of the closed-loop MIMO communication system, SoM-MIMO enables efficient and robust digital MIMO transmission of images. Experimental results have shown that compared with two JSCC baseline schemes, our approach achieves average mAP improvements of 6.30 and 10.48 across all SNR levels, while maintaining identical communication overhead. N the era of beyond fifth generation (B5G) and sixth generation (6G), a large number of mobile agents, including autonomous vehicles, unmanned aerial vehicles, and humanoid robots, etc., will interact in real-time and execute diverse intelligent functions, revolutionizing industries and daily life. To enable diverse intelligent functionalities, such as decision-making and task execution, accurate environmental perception--encompassing the acquisition of object position, size, and category--is essential. Manuscript received 24 April 2025; revised 20 July 2025; accepted 26 August 2025. This work was supported in part by the by the National Natural Science Foundation of China under Grant 62125101, Grant 62341101, and Grant 62271351; in part by the New Cornerstone Science Foundation through the XPLORER PRIZE. Rongqing Zhang is with Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China (email: rongqingz@tongji.edu.cn).


InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Ai, Qihang, Bu, Pi, Cao, Yue, Wang, Yingyao, Gu, Jihao, Xing, Jingxuan, Zhu, Zekun, Jiang, Wei, Zheng, Zhicheng, Song, Jun, Jiang, Yuning, Zheng, Bo

arXiv.org Artificial Intelligence

Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.